深度模型 loss为nan解决方案详解 |
您所在的位置:网站首页 › we learn网络出现问题 › 深度模型 loss为nan解决方案详解 |
1.loss为nan
深度模型进行训练的时候,经常出现loss为nan的情况。比如某一次训练过程的输出如下: Epoch 1/50 1/303 [..............................] - ETA: 0s - loss: 278.5685 - accuracy: 0.0000e+00 47/303 [===>..........................] - ETA: 0s - loss: nan - accuracy: 0.5745 95/303 [========>.....................] - ETA: 0s - loss: nan - accuracy: 0.5368 142/303 [=============>................] - ETA: 0s - loss: nan - accuracy: 0.5493 186/303 [=================>............] - ETA: 0s - loss: nan - accuracy: 0.5538 215/303 [====================>.........] - ETA: 0s - loss: nan - accuracy: 0.5535 248/303 [=======================>......] - ETA: 0s - loss: nan - accuracy: 0.5403 285/303 [===========================>..] - ETA: 0s - loss: nan - accuracy: 0.5439 303/303 [==============================] - 0s 1ms/step - loss: nan - accuracy: 0.5413 Epoch 2/50 1/303 [..............................] - ETA: 0s - loss: nan - accuracy: 1.0000 42/303 [===>..........................] - ETA: 0s - loss: nan - accuracy: 0.4762 77/303 [======>.......................] - ETA: 0s - loss: nan - accuracy: 0.4935 112/303 [==========>...................] - ETA: 0s - loss: nan - accuracy: 0.5000 150/303 [=============>................] - ETA: 0s - loss: nan - accuracy: 0.5067 190/303 [=================>............] - ETA: 0s - loss: nan - accuracy: 0.4842 230/303 [=====================>........] - ETA: 0s - loss: nan - accuracy: 0.5261 262/303 [========================>.....] - ETA: 0s - loss: nan - accuracy: 0.5458 303/303 [==============================] - 0s 1ms/step - loss: nan - accuracy: 0.5413 Epoch 3/50 1/303 [..............................] - ETA: 0s - loss: nan - accuracy: 0.0000e+00 40/303 [==>...........................] - ETA: 0s - loss: nan - accuracy: 0.5250 83/303 [=======>......................] - ETA: 0s - loss: nan - accuracy: 0.4940 128/303 [===========>..................] - ETA: 0s - loss: nan - accuracy: 0.4688 164/303 [===============>..............] - ETA: 0s - loss: nan - accuracy: 0.5183 202/303 [===================>..........] - ETA: 0s - loss: nan - accuracy: 0.5297 247/303 [=======================>......] - ETA: 0s - loss: nan - accuracy: 0.5385 ....好家伙,一上来loss就输出为nan…今天我们针对loss为nan的情况,来做一下总结。 2.一开始loss就为nan上面的例子,模型的输出一开始就为nan。这种情况大部分是data preprocessing(数据预处理)没做好。针对我上面的情况,做了一下简单的排查: total_nan_values = df.isnull().sum().sum() print('total_nan_values is: ', total_nan_values) df = df.fillna(0) after_total_nan_values = df.isnull().sum().sum() print('after fillna, total_nan_values is: ', after_total_nan_values)其中df是我的输入dataframe。经常检查发现,原始数据中na值有4个,然后做了个最简单粗暴的处理,将na值都用0填充,此时na值个数变为0,代码后续就能正常运行。 另外,如果输入的特征较大,各特征之间数值差异较大,可以先把所有特征normalization。 3.中途出现loss为nan如果训练中途出现loss为nan的情况,一般就要针对learning rate做文章了。 在分类问题中,如果learning rate太大会认为某些数据属于错误的类概率为1,而正确类的概率为0(实际上出现了浮点数下溢)。此时交叉熵loss − y i l o g y ^ − ( 1 − y i ) l o g ( 1 − y ^ ) -y_ilog\hat y - (1-y_i)log(1-\hat y) −yilogy^−(1−yi)log(1−y^) 也会变成无穷大。一旦出现该种情形,无穷大对参数求导会变成nan,整个网络就成了nan。 我们可以用个简单的代码测试一下上面的逻辑 def loss_test(): num = np.nan num = np.log(num) print(num) loss_test() nan某个数为nan,此时对其求对数,也为nan。 要解决上面的问题,方法是减小学习率,甚至将学习率变为0,看看问题是否仍然存在。若问题消失,那说明确实是学习率的问题。若问题仍存在,那说明刚刚初始化的网络就已经挂掉了,很可能是实现有错误。 4.loss function定义的问题有时候定义loss function也会导致nan的问题。比如出现log(0), x/0等,这类 计算问题也可能导致loss为nan的情况发生。 解决方法是在loss function定义是,可以加入一个小常数,我们经常在代码中见到这样一行: eps = tf.keras.backend.epsiloneps就是为了解决上述问题。源码中eps相关的声明如下 @keras_export('keras.backend.epsilon') @dispatch.add_dispatch_support def epsilon(): """Returns the value of the fuzz factor used in numeric expressions. Returns: A float. Example: >>> tf.keras.backend.epsilon() 1e-07 """ return _EPSILON可见其默认值为1e-07 5.标签传值错误标签值必须在损失函数的域中。如果使用基于对数的损失函数,所有的标签值必须是非负数。如果标签值不对也有可能出现loss为nan的情况。 参考文献1.https://stackoverflow.com/questions/40050397/deep-learning-nan-loss-reasons 2.https://www.zhihu.com/question/62441748 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |